Overview

Dataset Statistics

Number of Variables 12
Number of Rows 31762
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 11203
Duplicate Rows (%) 35.3%
Total Size in Memory 2.9 MB
Average Row Size in Memory 96.0 B
Variable Types
  • Numerical: 12

Dataset Insights

jaro_distance is skewed Skewed
jaro_winkler_distance is skewed Skewed
overlap_coefficient_distance is skewed Skewed
soft_tfidf_distance is skewed Skewed
partial_ration_distance is skewed Skewed
Dataset has 11203 (35.27%) duplicate rows Duplicates
levenshtain_distance has 2424 (7.63%) zeros Zeros
needleman_wunsch_distance has 2424 (7.63%) zeros Zeros
affine_gap_distance has 2424 (7.63%) zeros Zeros
smith_waterman_distance has 2424 (7.63%) zeros Zeros
jaro_winkler_distance has 21670 (68.23%) zeros Zeros
overlap_coefficient_distance has 9312 (29.32%) zeros Zeros
generalized_jaccard_distance has 2524 (7.95%) zeros Zeros
tfidf_distance has 2426 (7.64%) zeros Zeros
partial_ration_distance has 4888 (15.39%) zeros Zeros
bag_distance_distance has 2428 (7.64%) zeros Zeros
  • 1
  • 2

Variables


levenshtain_distance

numerical

Approximate Distinct Count 1657
Approximate Unique (%) 5.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.5174
Minimum 0
Maximum 0.9362
Zeros 2424
Zeros (%) 7.6%
Negatives 0
Negatives (%) 0.0%
  • levenshtain_distance is skewed left (γ1 = -1.1148)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0.4386
Median 0.5647
Q3 0.6571
95-th Percentile 0.7759
Maximum 0.9362
Range 0.9362
IQR 0.2185

Descriptive Statistics

Mean 0.5174
Standard Deviation 0.2073
Variance 0.04298
Sum 16432.5285
Skewness -1.1148
Kurtosis 0.794
Coefficient of Variation 0.4007
  • levenshtain_distance has 2644 outliers

needleman_wunsch_distance

numerical

Approximate Distinct Count 3253
Approximate Unique (%) 10.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.6796
Minimum 0
Maximum 1.3435
Zeros 2424
Zeros (%) 7.6%
Negatives 0
Negatives (%) 0.0%
  • needleman_wunsch_distance is skewed left (γ1 = -0.7639)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0.5556
Median 0.716
Q3 0.8605
95-th Percentile 1.0946
Maximum 1.3435
Range 1.3435
IQR 0.3049

Descriptive Statistics

Mean 0.6796
Standard Deviation 0.2867
Variance 0.08222
Sum 21584.7928
Skewness -0.7639
Kurtosis 0.4455
Coefficient of Variation 0.4219
  • needleman_wunsch_distance has 2514 outliers

affine_gap_distance

numerical

Approximate Distinct Count 8946
Approximate Unique (%) 28.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.5745
Minimum 0
Maximum 1.08
Zeros 2424
Zeros (%) 7.6%
Negatives 0
Negatives (%) 0.0%
  • affine_gap_distance is skewed left (γ1 = -0.9479)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0.4768
Median 0.6158
Q3 0.7281
95-th Percentile 0.8902
Maximum 1.08
Range 1.08
IQR 0.2513

Descriptive Statistics

Mean 0.5745
Standard Deviation 0.2351
Variance 0.05529
Sum 18246.8173
Skewness -0.9479
Kurtosis 0.6316
Coefficient of Variation 0.4093
  • affine_gap_distance has 2526 outliers

smith_waterman_distance

numerical

Approximate Distinct Count 1415
Approximate Unique (%) 4.5%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.5856
Minimum 0
Maximum 0.913
Zeros 2424
Zeros (%) 7.6%
Negatives 0
Negatives (%) 0.0%
  • smith_waterman_distance is skewed left (γ1 = -1.4811)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0.5217
Median 0.6489
Q3 0.7286
95-th Percentile 0.8158
Maximum 0.913
Range 0.913
IQR 0.2068

Descriptive Statistics

Mean 0.5856
Standard Deviation 0.2183
Variance 0.04766
Sum 18599.7557
Skewness -1.4811
Kurtosis 1.5748
Coefficient of Variation 0.3728
  • smith_waterman_distance has 2874 outliers

jaro_distance

numerical

Approximate Distinct Count 9710
Approximate Unique (%) 30.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.9862
Minimum 0.9091
Maximum 0.9955
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • jaro_distance is skewed left (γ1 = -3.5694)

Quantile Statistics

Minimum 0.9091
5-th Percentile 0.973
Q1 0.9848
Median 0.9881
Q3 0.9904
95-th Percentile 0.9929
Maximum 0.9955
Range 0.08639
IQR 0.005659

Descriptive Statistics

Mean 0.9862
Standard Deviation 0.007521
Variance 5.6571e-05
Sum 31323.6713
Skewness -3.5694
Kurtosis 21.3391
Coefficient of Variation 0.007627
  • jaro_distance is not normally distributed (p-value 1.613426079575541e-10)
  • jaro_distance has 2574 outliers

jaro_winkler_distance

numerical

Approximate Distinct Count 3675
Approximate Unique (%) 11.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.107
Minimum 0
Maximum 0.5174
Zeros 21670
Zeros (%) 68.2%
Negatives 0
Negatives (%) 0.0%
  • jaro_winkler_distance is skewed right (γ1 = 0.9225)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0.2862
95-th Percentile 0.3963
Maximum 0.5174
Range 0.5174
IQR 0.2862

Descriptive Statistics

Mean 0.107
Standard Deviation 0.1605
Variance 0.02576
Sum 3398.1375
Skewness 0.9225
Kurtosis -0.9713
Coefficient of Variation 1.5001
  • jaro_winkler_distance is not normally distributed (p-value 5.949394903578497e-25)

overlap_coefficient_distance

numerical

Approximate Distinct Count 67
Approximate Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.2511
Minimum 0
Maximum 0.9
Zeros 9312
Zeros (%) 29.3%
Negatives 0
Negatives (%) 0.0%
  • overlap_coefficient_distance is skewed right (γ1 = 0.2437)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0.2222
Q3 0.4286
95-th Percentile 0.6
Maximum 0.9
Range 0.9
IQR 0.4286

Descriptive Statistics

Mean 0.2511
Standard Deviation 0.206
Variance 0.04242
Sum 7974.1946
Skewness 0.2437
Kurtosis -1.0661
Coefficient of Variation 0.8204
  • overlap_coefficient_distance is not normally distributed (p-value 1.0419642302529725e-18)

generalized_jaccard_distance

numerical

Approximate Distinct Count 152
Approximate Unique (%) 0.5%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.5561
Minimum 0
Maximum 0.9524
Zeros 2524
Zeros (%) 7.9%
Negatives 0
Negatives (%) 0.0%
  • generalized_jaccard_distance is skewed left (γ1 = -1.2735)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0.5
Median 0.6154
Q3 0.7059
95-th Percentile 0.8
Maximum 0.9524
Range 0.9524
IQR 0.2059

Descriptive Statistics

Mean 0.5561
Standard Deviation 0.22
Variance 0.04839
Sum 17662.9177
Skewness -1.2735
Kurtosis 0.9169
Coefficient of Variation 0.3956
  • generalized_jaccard_distance has 3152 outliers

tfidf_distance

numerical

Approximate Distinct Count 1002
Approximate Unique (%) 3.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.6463
Minimum 0
Maximum 0.9755
Zeros 2426
Zeros (%) 7.6%
Negatives 0
Negatives (%) 0.0%
  • tfidf_distance is skewed left (γ1 = -1.5777)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0.5848
Median 0.7226
Q3 0.8075
95-th Percentile 0.8794
Maximum 0.9755
Range 0.9755
IQR 0.2228

Descriptive Statistics

Mean 0.6463
Standard Deviation 0.2374
Variance 0.05634
Sum 20528.4568
Skewness -1.5777
Kurtosis 1.7954
Coefficient of Variation 0.3672
  • tfidf_distance has 2710 outliers

soft_tfidf_distance

numerical

Approximate Distinct Count 18015
Approximate Unique (%) 56.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.989
Minimum 0.9091
Maximum 0.9983
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • soft_tfidf_distance is skewed left (γ1 = -3.6864)

Quantile Statistics

Minimum 0.9091
5-th Percentile 0.9757
Q1 0.9877
Median 0.9912
Q3 0.9933
95-th Percentile 0.9956
Maximum 0.9983
Range 0.08926
IQR 0.00563

Descriptive Statistics

Mean 0.989
Standard Deviation 0.007666
Variance 5.8773e-05
Sum 31414.0481
Skewness -3.6864
Kurtosis 22.4816
Coefficient of Variation 0.007751
  • soft_tfidf_distance is not normally distributed (p-value 5.813513359351851e-12)
  • soft_tfidf_distance has 2590 outliers

partial_ration_distance

numerical

Approximate Distinct Count 74
Approximate Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.2894
Minimum 0
Maximum 0.74
Zeros 4888
Zeros (%) 15.4%
Negatives 0
Negatives (%) 0.0%
  • partial_ration_distance is skewed left (γ1 = -0.4033)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0.15
Median 0.33
Q3 0.42
95-th Percentile 0.53
Maximum 0.74
Range 0.74
IQR 0.27

Descriptive Statistics

Mean 0.2894
Standard Deviation 0.1712
Variance 0.02931
Sum 9192.94
Skewness -0.4033
Kurtosis -0.9196
Coefficient of Variation 0.5915
  • partial_ration_distance is not normally distributed (p-value 2.3912383922905273e-14)

bag_distance_distance

numerical

Approximate Distinct Count 1780
Approximate Unique (%) 5.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 508192
Mean 0.4025
Minimum 0
Maximum 0.8957
Zeros 2428
Zeros (%) 7.6%
Negatives 0
Negatives (%) 0.0%
  • bag_distance_distance is skewed left (γ1 = -0.2354)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0.2857
Median 0.4103
Q3 0.5333
95-th Percentile 0.717
Maximum 0.8957
Range 0.8957
IQR 0.2476

Descriptive Statistics

Mean 0.4025
Standard Deviation 0.1943
Variance 0.03776
Sum 12784.2583
Skewness -0.2354
Kurtosis -0.2572
Coefficient of Variation 0.4828
  • bag_distance_distance is not normally distributed (p-value 0.0046412965358985266)

Interactions

Correlations

Missing Values